Background

To allow comparison of gene expression with a metadata variable of interest, BITHub contains comprehensive metadata annotations of the curated datasets. The three main categories of annotation are present in BITHub:

In order to ensure the metadata information is displayed in a user-friendly manner, highly correlated metadata annotations will be removed and a subset will be used for the site.

Set-up

Metedata correlations - Bulk datasets

Correlation of metadata was prepared after processing the raw metadata and expression files. If you are interested in that part of the pipeline, please refer to the README.md in the Github repo.

BrainSeq

Prior to running cor() function, the FQCbasicStats, perSeqQual, SeqLengthDist and KmerContent columns were removed as they contained the same value, resulting in NAs.

BrainSeq metadata annotations shows duplicate information in many of its columns (e.g SampleID, SAMPLEID), which are likely a result of running the pre-processing pipeline for BITHub. Additionally, certain columns contain very similar information thus resulting in high correlation. Several QC metrics for RNA-seq QC also provide redundant information and therefore these will be removed for downstream analysis.

*Correlation plot of metadata annotations from BrainSeq phase II. The metadata annotations are clustered based on correlation*

Correlation plot of metadata annotations from BrainSeq phase II. The metadata annotations are clustered based on correlation

The final BrainSeq metadata annotations will contain the following columns:

OriginalMetadataColumnName BITColumnName Type
X SampleID Sample charactertics
trimmed trimmed Sequencing metrics
numReads TotalNReads Sequencing metrics
numMapped numMapped Sequencing metrics
numUnmapped numUnmapped Sequencing metrics
overallMapRate MappingRate Sequencing metrics
concordMapRate concordMapRate Sequencing metrics
totalMapped totalMapped Sequencing metrics
mitoMapped mitoMapped Sequencing metrics
mitoRate mito_Rate Sequencing metrics
totalAssignedGene totalAssignedGene Sequencing metrics
rRNA_rate rRNA_rate Sequencing metrics
RNum SampleID Phenotype
Region StructureAcronym Sample charactertics
RIN RIN Sequencing metrics
Age AgeNumeric Phenotype
Sex Sex Phenotype
Race Ethnicity Phenotype
Dx Diagnosis Phenotype
Fetal_replicating Dev.Replicating Sample charactertics
Fetal_quiescent Dev.Quiescent Sample charactertics
OPC Adult.OPC Sample charactertics
Neurons Adult.Neurons Sample charactertics
Astrocytes Adult.Astrocytes Sample charactertics
Oligodendrocytes Adult.Oligo Sample charactertics
Microglia Adult.Microglia Sample charactertics
Endothelial Adult.Endothelial Sample charactertics
NA AgeInterval Phenotype
NA Period Phenotype
NA Regions Sample charactertics

BrainSpan

BrainSpan metadata annotations contain several duplicate and redundant columns that essentially contain the same information (e.g column_num, Age.x, Braincode). BrainSpan annotations were retrieved from multiple sources, which may have led to these duplicates in annotations under different column names.

*Correlation plot of metadata annotations from BrainSpan. The metadata annotations are clustered based on correlation*

Correlation plot of metadata annotations from BrainSpan. The metadata annotations are clustered based on correlation

The following BrainSpan metadata annotations will be used for BITHub:

OriginalMetadataColumnName BITColumnName Type
SampleID SampleID Sample characteristics
gender Sex Phenotype
structure_acronym StructureAcronym Sample characteristics
NA Period Phenotype
NA AgeNumeric Phenotype
NA AgeInterval Phenotype
NA Diagnosis Phenotype
NA Regions Sample characteristics
NA mRIN Sequencing metrics
Hemisphere Hemisphere Sample characteristics
RIN RIN Sequencing metrics
PMI PMI Sequencing metrics
pH pH Sequencing metrics
Ethnicity Ethnicity Phenotype

GTEx

The GTEx metadata contains comprehensive annotations of sample, sequencing and phenotype attributes. However, redundant and strongly correlated annotations, particularly for sequencing metrics, will be removed.

*Correlation plot of metadata annotations of brain samples from GTEx. The correlations are plotted as numeric values to allow a btter overview of the data.*

Correlation plot of metadata annotations of brain samples from GTEx. The correlations are plotted as numeric values to allow a btter overview of the data.

The above figure shows NA values on the correlation plot. This is primarily due to either missing values in certain columns, or no difference in values when performing correlation. Annotations which are highly correlated in the same category (e.g ReadsMapped and TotalNReads) will be removed. Additionally, information that contains too many missing annotations will also be removed, as these will impact the readability of the plots on BITHub.

The following metadata annotations will be used for GTEx:

OriginalMetadataColumnName BITColumnName Type
SAMPID SampleID Sample charactertics
SMRIN RIN Sequencing metrics
SMTSISCH PMI Sequencing metrics
AGE AgeInterval Phenotype
SEX Sex Phenotype
SMATSSCR AutolysisScore Sample charactertics
SMNABTCH IsolationBatchID Sample charactertics
SMNABTCHT TypeofBatch Sample charactertics
SMNABTCHD DateofBatch Sample charactertics
SMGEBTCH Genotype_or_Expression_Batch_ID Sample charactertics
SMGEBTCHD DateofGenotypeorExpressionBatch Sample charactertics
SMGEBTCHT TypeofGenotypeorExpressionBatch Sample charactertics
SMCENTER BSS_Collection_side_code Sample charactertics
SMTS Regions Sample charactertics
SMTSD StructureAcronym Sample charactertics
SMTSPAX Time_spent_in_PAXgene_fixative Sequencing metrics
SME2MPRT End_2_mapping_rate Sequencing metrics
SMCHMPRS ChimericPairs Sequencing metrics
SMNTRART IntragenicRate Sequencing metrics
SMNUMGPS No_of_Gaps Sequencing metrics
SMMAPRT MappingRate_total Sequencing metrics
SMEXNCRT ExonicRate Sequencing metrics
SM550NRM BasedNormalised Sequencing metrics
SMGNSDTC GenesDetected Sequencing metrics
SMUNMPRT Rate_of_mapped_genes_unique Sequencing metrics
SM350NRM BaseNormilization Sequencing metrics
SMESTLBS LibrarySize Sequencing metrics
SMMPPD ReadsMapped Sequencing metrics
SMNTERRT IntergenicRate Sequencing metrics
SMRRNANM rRNA Sequencing metrics
SMRDTTL TotalNReads Sequencing metrics
SMMNCV Mean_Coeff_Variation Sequencing metrics
SMTRSCPT TranscriptsDetected Sequencing metrics
SMMPPDPR MappedPairs Sequencing metrics
SMUNPDRD UnpairedReads Sequencing metrics
SMNTRNRT IntronicRate Sequencing metrics
SMMPUNRT Mapped_unique_rate_of_total Sequencing metrics
SMEXPEFF ExpressionProfilingEfficiency Sequencing metrics
SMMPPDUN MappedUnique_no_dup_flags Sequencing metrics
SME2MMRT End_2_Mismatch_Rate Sequencing metrics
SME2ANTI End_2_Antisense Sequencing metrics
SME2SNSE End_Sense_2 Sequencing metrics
SME1ANTI End_1_Antisense Sequencing metrics
SME1SNSE End_1_Sense Sequencing metrics
SME1PCTS End_1_Sense_percentage Sequencing metrics
SMRRNART rRNA_rate Sequencing metrics
SME1MPRT End_1_Mapping_rate Sequencing metrics
SMNUM5CD Num_of_Reads_Covered_5prime Sequencing metrics
SMDPMPRT DuplicationRateMapped Sequencing metrics
SME2PCTS Percentage_IntragenicEnd_2_Reads Sequencing metrics
DTHHRDY HardyScale Phenotype

PsychEncode

PsychEncode metadata annotations contains limited information on the sequencing metrics. Additionally, there are some metadata annotations that show similar information and therefore are highly correlated. These include Row_IDs, Row_Versions, Contributing Studies and Notes. These columns will be removed for BITHub.

*Correlation plot of metadata annotations from PsychEncode. Due to missing values in the correlation matrix, the annotations could not be clustered*

Correlation plot of metadata annotations from PsychEncode. Due to missing values in the correlation matrix, the annotations could not be clustered

The following metadata annotations will be retained for PsychEncode:

OriginalColumnName BITColumnName Type
individualID SampleID Sample charactertics
diagnosis Diagnosis Phenotype
sex Sex Phenotype
ethnicity Ethnicity Phenotype
ageDeath AgeNumeric Phenotype
Adult.Ex1 Adult.Ex1 Sample charactertics
Adult.Ex2 Adult.Ex2 Sample charactertics
Adult.Ex3 Adult.Ex3 Sample charactertics
Adult.Ex4 Adult.Ex4 Sample charactertics
Adult.Ex5 Adult.Ex5 Sample charactertics
Adult.Ex6 Adult.Ex6 Sample charactertics
Adult.Ex7 Adult.Ex7 Sample charactertics
Adult.Ex8 Adult.Ex8 Sample charactertics
Adult.In1 Adult.In1 Sample charactertics
Adult.In2 Adult.In2 Sample charactertics
Adult.In3 Adult.In3 Sample charactertics
Adult.In4 Adult.In4 Sample charactertics
Adult.In5 Adult.In5 Sample charactertics
Adult.In6 Adult.In6 Sample charactertics
Adult.In7 Adult.In7 Sample charactertics
Adult.In8 Adult.In8 Sample charactertics
Adult.Astrocytes Adult.Astrocytes Sample charactertics
Adult.Endothelial Adult.Endothelial Sample charactertics
Dev.Quiescent Dev.Quiescent Sample charactertics
Dev.Replicating Dev.Replicating Sample charactertics
Adult.Microglia Adult.Microglia Sample charactertics
Adult.OtherNeuron Adult.OtherNeuron Sample charactertics
Adult.OPC Adult.OPC Sample charactertics
Adult.Oligo Adult.Oligo Sample charactertics
structure_acronym StructureAcronym Sample charactertics
ageOnset ageOnset Phenotype
causeDeath causeDeath Phenotype
brainWeight brainWeight Phenotype
height height Phenotype
weight weight Phenotype
ageBiopsy ageBiopsy Sample charactertics
smellTestScore smellTestScore Sample charactertics
smoker smoker Sample charactertics
Capstone_4 Capstone_4 Sample charactertics
NA Period Phenotype
NA AgeInterval Phenotype
NA Regions Sample charactertics

Determing drivers of variation

A fundamental challenge in the analysis of complex RNA-seq datasets is determining the impact of sources of variation and their relationship with gene expression. To identify these impacts, we used variancePartition, an Bioconductor package, that uses mixed linear model to estimate the proportion of variance explained by selected covariates. Currently variancePartition has only been applied to the bulk RNA-seq datasets.

In variancePartition workflow, the contribution of highly correlated covariates is divided and therefore will result in smaller overall contributions to variation explained by these covariates. To ensure we are selecting covariates that are providing the most useful information about the data, we will also perform a canonical correlation analysis (CCA), which assesses the degree to which variables co-vary and contain the sample information. This is to ensure that variables that we selected based on the above correlations are indeed those that provide the most valueble insight into the data,

BrainSeq

Prior to running the variancePartition pipeline, we will remove lowly expressed genes from the downstream analysis as it will skew the analysis. We will use a generous expression cut-off of 1 RPKM in at least 10% of all samples from BrainSeq. This reduces the number of genes from 58,037 to 20,452 in the expression matrix.

Density of genes in BrainSeq with respect to expression before and after filtering for lowly expressed genes

Density of genes in BrainSeq with respect to expression before and after filtering for lowly expressed genes

For the BrainSeq Phase II dataset, we want to ensure that metadata variables we select are the most indicative of the contribution of variation from each category. We will use AgeNumeric, RIN, mito_Rate, rRNA_rate, TotalNReads, MappingRate, StructureAcronym, Sex and Diagnosis. All these selected attributes reflect different aspects of the dataset and it will be useful for the user to which of these factors is driving the expression of their gene of interest.

*Assessing correlation between covariates of interest from the BrainSeq data*

Assessing correlation between covariates of interest from the BrainSeq data

The correlation plot reveals that many of the selected covariates do not correlate highly within their respective category, and therefore we will feed these into the mixed linear model.

*variancePartition results for BrainSeq. Plot on right shows impact of covariates of interest on randomly selected genes whereaas left shows overall impact of covariates on all expression.*

variancePartition results for BrainSeq. Plot on right shows impact of covariates of interest on randomly selected genes whereaas left shows overall impact of covariates on all expression.

BrainSpan

We removed lowly expressed genes proir to running variancePartition. The same cut-off was applied to BrainSpan as BrainSeq, where genes < 1 RPKM in 1 in at least 10% of all samples were removed, resulting in 19,671 genes from 52,379 genes

*Density of genes in BrainSpan with respect to expression before and after filtering lowly expressed genes*

Density of genes in BrainSpan with respect to expression before and after filtering lowly expressed genes

The selected metadata variables for BrainSpan include AgeNumeric, RIN, mRIN, pH, PMI, StructureAcronym, Regions and Period.

*Assessing correlation between covariates of interest from the BrainSpan data*

Assessing correlation between covariates of interest from the BrainSpan data

*variancePartition results for BrainSpan. Plot on right shows impact of covariates of interest on randomly selected genes whereaas left shows overall impact of covariates on all expression.*

variancePartition results for BrainSpan. Plot on right shows impact of covariates of interest on randomly selected genes whereaas left shows overall impact of covariates on all expression.

GTEx

Lowly expressed genes were removed from the GTEx expression matrix, where a cut-ff of 1 < TPM in less than 10% of the samples was applied. This resulted in 20,849 from 56,200 genes.

*Assessing correlation between covariates of interest from the GTEx data*

Assessing correlation between covariates of interest from the GTEx data

PsychEncode

*Assessing correlation between covariates of interest from the PsychEncode data*

Assessing correlation between covariates of interest from the PsychEncode data

*Assessing correlation between covariates of interest from the PsychEncode data*

Assessing correlation between covariates of interest from the PsychEncode data